Getting started with the Google Genomics API

In this notebook we'll cover how to make authenticated requests to the Google Genomics API.


NOTE:

  • If you're new to notebooks, or want to check out additional samples, check out the full list of general notebooks.
  • For additional Genomics samples, check out the full list of Genomics notebooks.

Setup

Install Python libraries

We'll be using the Google Python API client for interacting with Genomics API. We can install this library, or any other 3rd-party Python libraries from the Python Package Index (PyPI) using the pip package manager.

There are 50+ Google APIs that you can work against with the Google Python API Client, but we'll focus on the Genomics API in this notebook.


In [1]:
!pip install --upgrade google-api-python-client


Requirement already up-to-date: google-api-python-client in /usr/local/lib/python2.7/dist-packages
Cleaning up...

Create an Authenticated Client

Next we construct a Python object that we can use it to make requests.

The following snippet shows how we can authenticate using the service account on the Datalab host. For more detail about authentication from Python, see Using OAuth 2.0 for Server to Server Applications.


In [2]:
from httplib2 import Http
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()
http = Http()
credentials.authorize(http)


Out[2]:
<httplib2.Http at 0x7f6433997a90>
Out[2]:
<httplib2.Http at 0x7f6433997a90>

And then we create a client for the Genomics API.


In [3]:
from apiclient.discovery import build
genomics = build('genomics', 'v1', http=http)

Send a request to the Genomics API

Now that we have a Python client for the Genomics API, we can access a variety of different resources. For details about each available resource, see the python client API docs here.

Using our genomics client, we'll demonstrate fetching a Dataset resource by ID (the 1000 Genomes dataset in this case).

First, we need to construct a request object.


In [4]:
request = genomics.datasets().get(datasetId='10473108253681171589')

Next, we'll send this request to the Genomics API by calling the request.execute() method.


In [5]:
response = request.execute()

You will need enable the Genomics API for your project if you have not done so previously. Click on this link to enable the API in your project.

The response object returned is simply a Python dictionary. Let's take a look at the properties returned in the response.


In [6]:
for entry in response.items():
    print "%s => %s" % entry


projectId => genomics-public-data
id => 10473108253681171589
createTime => 1970-01-01T00:00:00.000Z
name => 1000 Genomes

Success! We can see the name of the specified Dataset and a few other pieces of metadata.

Accessing other Genomics API resources will follow this same set of steps. The full list of available resources within the API is here. Each resource has details about the different verbs that can be applied (e.g., Dataset methods).

Access Data

In this portion of the notebook, we implement this same example implemented as a python script. First let's define a few constants to use within the examples that follow.


In [7]:
dataset_id = '10473108253681171589' # This is the 1000 Genomes dataset ID
sample = 'NA12872'
reference_name = '22'
reference_position = 51003835

Get read bases for a sample at specific a position

First find the read group set ID for the sample.


In [8]:
request = genomics.readgroupsets().search(
  body={'datasetIds': [dataset_id], 'name': sample},
  fields='readGroupSets(id)')
read_group_sets = request.execute().get('readGroupSets', [])
if len(read_group_sets) != 1:
  raise Exception('Searching for %s didn\'t return '
                  'the right number of read group sets' % sample)

read_group_set_id = read_group_sets[0]['id']

Once we have the read group set ID, lookup the reads at the position in which we are interested.


In [9]:
request = genomics.reads().search(
  body={'readGroupSetIds': [read_group_set_id],
        'referenceName': reference_name,
        'start': reference_position,
        'end': reference_position + 1,
        'pageSize': 1024},
  fields='alignments(alignment,alignedSequence)')
reads = request.execute().get('alignments', [])

And we print out the results.


In [10]:
# Note: This is simplistic - the cigar should be considered for real code
bases = [read['alignedSequence'][
           reference_position - int(read['alignment']['position']['position'])]
         for read in reads]

print '%s bases on %s at %d are' % (sample, reference_name, reference_position)

from collections import Counter
for base, count in Counter(bases).items():
  print '%s: %s' % (base, count)


NA12872 bases on 22 at 51003835 are
C: 1
G: 13

Get variants for a sample at specific a position

First find the call set ID for the sample.


In [11]:
request = genomics.callsets().search(
  body={'variantSetIds': [dataset_id], 'name': sample},
  fields='callSets(id)')
resp = request.execute()
call_sets = resp.get('callSets', [])
if len(call_sets) != 1:
  raise Exception('Searching for %s didn\'t return '
                  'the right number of call sets' % sample)

call_set_id = call_sets[0]['id']

Once we have the call set ID, lookup the variants that overlap the position in which we are interested.


In [12]:
request = genomics.variants().search(
  body={'callSetIds': [call_set_id],
        'referenceName': reference_name,
        'start': reference_position,
        'end': reference_position + 1},
  fields='variants(names,referenceBases,alternateBases,calls(genotype))')
variant = request.execute().get('variants', [])[0]

And we print out the results.


In [13]:
variant_name = variant['names'][0]
genotype = [variant['referenceBases'] if g == 0
            else variant['alternateBases'][g - 1]
            for g in variant['calls'][0]['genotype']]

print 'the called genotype is %s for %s' % (','.join(genotype), variant_name)


the called genotype is G,G for rs131767